ATMDP-007: Environmental Data Science
2024-01-24
What is an IDE?
Which R IDE should I use?
RStudio: RStudio is widely recognized as the most popular IDE for R programming. It’s a free, open-source IDE providing an extensive range of features like a code editor, debugger, console, and R Markdown support.
Jupyter Notebook: This web-based IDE is known for its data science and machine learning capabilities. It allows creating and sharing interactive documents with code, text, and visualizations.
Visual Studio Code: A free and open-source editor, VS Code is a popular choice for R programming. It offers features like syntax highlighting, code completion, and debugging.
ESS (Emacs Speaks Statistics): Combines the Emacs text editor with ESS package to provide R features like syntax highlighting and debugging.
Eclipse StatET: Integrates Eclipse IDE with the StatET plugin, offering features like a code editor and debugger for R programming.
Sublime Text: A lightweight and robust code editor for R programming, offering features like syntax highlighting and code completion.
RKWard: A free and open-source IDE designed specifically for R, with a user-friendly interface and features like a code editor and debugger.
PyCharm: Known for Python programming, PyCharm also supports R programming, offering advanced features.
tidyversetidyverse?The tidyverse is a collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures. tidyverse packages provide a cohesive and coherent toolkit for data manipulation, exploration, and visualization that is designed to make data science faster and easier.
ggplot2: For data visualization, using a layered grammar of graphics.
dplyr: For data manipulation, such as filtering rows, selecting columns, and summarizing data.
tidyr: For tidying data, changing the layout of datasets to a tidy format.
readr: For importing data, particularly from CSV and similar flat file formats.
purrr: For functional programming, enabling operations on lists and vectors.
tibble: For modern reimagining of data frames, keeping things simple and tidy.
stringr: For string manipulation and regular expressions.
forcats: For handling categorical variables (factors).
dplyr, allowing dplyr syntax to be used to manipulate data stored in a relational database, translating dplyr code into SQL.data.table backend for dplyr, enabling the use of dplyr syntax while leveraging the speed and efficiency of data.table for large datasets.%>%, which allows for cleaner and more readable code by enabling the chaining of commands in a sequence of data operations.tidyversetidyverse is easy, we can do it directly from our console using the suit of packages stored on CRAN.tidyverse on our computer, we must load it.── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Note that some conflicts with other packages might emerge (this is not due to tidyverse, but rather a particularity of R).
This happens because there are more than one function with the same name loaded in our R environment.
One way around this is to use the name of the package before the function separated by ::
dplyr::filter() instead of filter().tidyversetidyverse offers a significant advantage over base R primarily through its consistent and user-friendly syntax, making data manipulation and analysis more intuitive and accessible, especially for beginners.
Its collection of packages are designed to work together seamlessly, streamlining workflows in data science.
This integration reduces the learning curve and enhances productivity, allowing for more readable and maintainable code.
tidyverse’s emphasis on tidy data principles aids in creating more organized and understandable data structures, facilitating easier data analysis and visualization.
small parenthesis
Tidy data is a concept and format in data preparation that simplifies data analysis in statistics and data science. It adheres to three main principles.
| Sample ID | Treatment | Soil Temperature |
|---|---|---|
| HYY-C-2024-01 | Control | 2 |
| HYY-C-2024-02 | Control | 3 |
| HYY-C-2024-03 | Drought | 6 |
| HYY-C-2024-04 | Drought | 7 |
| Sample ID | Control | Drought |
|---|---|---|
| HYY-C-2024-01 | 2 | NA |
| HYY-C-2024-02 | 3 | NA |
| HYY-C-2024-03 | NA | 6 |
| HYY-C-2024-04 | NA | 7 |
tidyverse syntaxtidyverse syntaxThe magrittr package and its pipe operator %>% play a crucial role in the tidyverse syntax by enabling a more intuitive and readable flow of data manipulation steps.
This operator allows for chaining together functions in a sequence, transforming the data step-by-step.
This approach not only enhances readability but also simplifies the process of writing and understanding complex data transformations.
Sequential Operations: The pipe operator allows for a sequence of operations to be chained together. This leads to code that reads more like a series of steps, which aligns closely with the way we logically think about data processing tasks.
Reduction in Nesting: Without the pipe operator, functions are nested inside each other, which can make code difficult to read and understand. The pipe operator reduces this nesting, making the code cleaner and more straightforward.
Modifying Code: When using the pipe operator, it’s easier to add, remove, or change steps in our data processing pipeline. This flexibility makes debugging and maintaining code simpler.
Troubleshooting: We can insert a breakpoint or a diagnostic function at any point in the pipeline to inspect intermediate results, which helps in troubleshooting.
Modular Approach: The pipe operator encourages a modular approach to code writing. Each step in the pipeline does one thing, which is a good programming practice. This modularity also makes the code more reusable.
Focus on Data Flow: The pipe operator emphasizes the flow of data through a series of transformations, which aligns well with many data analysis tasks.
Consistency with Tidyverse: The pipe operator is part of the tidyverse’s coherent and consistent approach to data science. It works seamlessly with other tidyverse packages (like dplyr, tidyr ), which are designed to work with pipe-friendly syntax.
Functional Style: The pipe operator supports a more functional style of programming, where the focus is on the transformation of data rather than the manipulation of state.
5. Improved Learning curve for beginners:
Intuitive for New Users: For those new to R, the pipe operator can make learning easier. The clear, step-by-step nature of piped commands is often more intuitive than nested function calls.
Alignment with Natural Language: The pipe operator’s syntax is somewhat analogous to natural language (“Take this data, then do this, then do that”), which can be easier for beginners to grasp.
x %>% f is equivalent to f(x)
x %>% f(y) is equivalent to f(x, y)
x %>% f %>% g %>% h is equivalent to h(g(f(x)))
tidyverse# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
glimpse()) comes first, and the data object (starwars) is nested in it.Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
Useful tip
We can type the pipe operator (%>%) quickly by using the shortcut:
ctrl + shift + mtidyverse core packagestidyverse core packagesFoundation for data science
tibble: modern take on data frames, ensuring ease of use and compatibility with the tidyverse ecosystem.
dplyr: key for data manipulation tasks, offering intuitive functions for filtering, sorting, and summarizing data efficiently.
tidyr: essential for data tidying, transforming datasets into a structured, readable format.
Why start with these?
Practical relevance: these packages address the most common data processing tasks - organizing, transforming, and summarizing data.
Ease to learn: mastering these packages provides a strong foundation, making it easier to understand and utilize other tidyverse packages.
Immediate application: skills in tibble, dplyr, and tidyr are immediately applicable in a wide range of data processing and analysis scenarios.
Building a Strong Base
Focusing on these packages first provides understanding of essential tools needed for most data processing tasks.
Encourages a smoother transition to more complex aspects of data analysis in the tidyverse.
tibbletibbleWhy tibble?
Enhanced Data Frames: tibbles are an evolution of the traditional data frame in R, offering a more modern, tidyverse-compatible structure.
User-Friendly: easier to use and understand, especially for those new to R.
tibble vs. data.frame:
Printing: tibble print a small subset of data, making them more manageable with large datasets.
Data type preservation: unlike data frames, tibble do not convert character vectors to factors by default.
Subsetting behavior: tibble is more consistent in returning tibble structures, whereas data.frame can change structure based on the subset.
Row names: tibble do not use row names, which simplifies their structure and avoids some common data manipulation errors.
Column subsetting: tibble is more predictable with column subsetting, always returning a tibble even with a single column, unlike data frames which might return a vector.
Non-syntactic names: tibble allows columns to have non-syntactic names without requiring backticks, making them flexible with data from diverse sources.
tibbleA few relevant tibble functions
as_tibble(): Converts existing data structures into tibbles.
tibble(): Creates tibble data frames directly.
tribble(): Allows for easy creation of tibbles with a readable layout.
add_row(): Adds rows to an existing tibble.
add_column(): Adds columns to an existing tibble.
Note that there are more functions in tibble package that might be relevant to our interests.
tibbleUsing the pipe operator (%>%) to connect processing steps, we can transform a dataset into tibble, and add a new row in one go.
library(tidyverse)
# Convert iris to a tibble and add a new row using pipe operator
iris %>% # this is a classic default dataset in R
as_tibble() %>% # transforming the data.frame to tibble
add_row(Sepal.Length = 5.5,
Sepal.Width = 3,
Petal.Length = 1.2,
Petal.Width = 0.1,
Species = "cherry") %>% # adding a new row in the dataset
tail() # to check the row added# A tibble: 6 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 6.7 3 5.2 2.3 virginica
2 6.3 2.5 5 1.9 virginica
3 6.5 3 5.2 2 virginica
4 6.2 3.4 5.4 2.3 virginica
5 5.9 3 5.1 1.8 virginica
6 5.5 3 1.2 0.1 cherry
tibbleOtherwise we could use the R base functions to do the same job.
# Convert iris to a data frame (it's already a data frame, so this step is more about clarity)
iris_df <- data.frame(iris)
# Extend the levels of the Species factor to include "cherry"
iris_df$Species <- factor(iris_df$Species, levels = c(levels(iris_df$Species), "cherry"))
# Adding a new row to the data frame
new_row <- data.frame(Sepal.Length = 5.5,
Sepal.Width = 3,
Petal.Length = 1.2,
Petal.Width = 0.1,
Species = "cherry")
iris_df <- rbind(iris_df, new_row)
# Display the last few rows of the data frame to check the row added
tail(iris_df) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
151 5.5 3.0 1.2 0.1 cherry
Note that the the code is more complex (e.g. adding the level in Species) and the output does not show the data classes (e.g. dbl, chr, fct).
tibbleCreating a tibble from the scratch using tribble()
easy step-by-step dataset building
entries are row by row oriented
good for small datasets
columns are identified by ~ and separated by ,
tibbleCreating a data.frame from the scratch using base R.
the information is not row by row oriented
not easy to follow in slightly larger datasets
tibbleAdding rows and columns in a tibble()
add_row(): adds rows to the dataset
add_column(): adds columns to the dataset
library(tibble)
felines <- tribble(
~species, ~weight, ~length,
"lion", 190, 2.4,
"tiger", 220, 2.5,
"jaguar", 100, 2.0
)
felines %>%
add_row(species = "leopard", # Adding a row for the leopard
weight = 90,
length = 2.1) %>%
add_column(scientific_name = c("Panthera leo", "Panthera tigris", "Panthera onca", "Panthera pardus"), # Adding a column for the scientific name
.after = "species") #specifying where the new column will be added# A tibble: 4 × 4
species scientific_name weight length
<chr> <chr> <dbl> <dbl>
1 lion Panthera leo 190 2.4
2 tiger Panthera tigris 220 2.5
3 jaguar Panthera onca 100 2
4 leopard Panthera pardus 90 2.1
tibblebase R.
# Creating a data.frame
felines <- data.frame(species = c("lion", "tiger", "jaguar"),
weight = c(190, 220, 100),
length = c(2.4, 2.5, 2.0))
# Adding a row for the leopard
felines <- rbind(felines, c("leopard", 90, 2.1))
# Adding a column for the scientific name
felines$scientific_name <- c("Panthera leo", "Panthera tigris", "Panthera onca", "Panthera pardus")
# Reordering columns to place scientific_name after species
felines <- felines[c("species", "scientific_name", "weight", "length")]
felines species scientific_name weight length
1 lion Panthera leo 190 2.4
2 tiger Panthera tigris 220 2.5
3 jaguar Panthera onca 100 2
4 leopard Panthera pardus 90 2.1
dplyrdplyrdplyr empowers users to efficiently handle and transform data, making it a vital tool for any R data processing and manipulation task.
Probably the most used, and maybe important, of all tidyverse core packages.
Some relevant dplyr functions:
filter(): Extracts rows based on specified conditions.
select(): Chooses columns, simplifying dataset structure.
rename(): Changes the names of individual variables.
mutate(): Creates or transforms variables, enhancing data with new insights.
group_by(): Facilitates grouped calculations, enhancing data analysis scope.
summarise(): Aggregates data, ideal for generating summaries.
*_join(): Merges two datasets based on a common key. It is actually used as full_join, inner_join(), left_join() and right_join().
arrange(): Orders rows by variable values.
distinct(): Keep only unique/distinct rows from a data frame.
relocate(): Change the order of the columns in the dataset.
if_else(): A vectorized if-else function. Similar to, but more rigorous than, ifelse().
case_when(): A vectorized set of if-else statements.
*Note that other functions from dplyr might be more relevant depending on our specific needs.
dplyrWith dplyr is possible to make a series of data processing without the need to ave a new object in R. Each step is integrated with the pipe operator %>% and the code is easily comprehended.
library(tidyverse)
# Using the storms dataset
storms %>%
filter(year >= 2020) %>% # Keep only storms from 2020 onwards
select(name, year, status, wind) %>% # Choose specific columns
group_by(name, year) %>% # Grouping by name and year
summarise(average_wind = mean(wind, na.rm = TRUE), .groups = "drop") %>% # calculating the average wind speed
rename(average_wind_speed = average_wind) %>% # Change 'average_wind' to 'average_wind_speed'
arrange(desc(average_wind_speed)) %>% # Order by wind speed in descending order
relocate(year, .before = name) # Move 'year' column right before 'name'# A tibble: 66 × 3
year name average_wind_speed
<dbl> <chr> <dbl>
1 2021 Sam 93.2
2 2021 Larry 82.2
3 2022 Ian 70.5
4 2020 Teddy 70.3
5 2022 Fiona 69.2
6 2020 Delta 67.7
7 2020 Iota 63.7
8 2020 Zeta 59.2
9 2020 Isaias 57.6
10 2020 Epsilon 56.3
# ℹ 56 more rows
dplyrIt is also possible to achieve the same result using base R syntax and functions. However the code is a bit less clear and it required the saving of different objects on the R environment.
library(tidyverse) # to access the storms dataset
# Filtering storms from 2020 onwards and selecting specific columns
storms_filtered <- subset(storms, year >= 2020, select = c(name, year, status, wind))
# Grouping by name and year, and calculating the average wind speed
storms_aggregated <- aggregate(wind ~ name + year, data = storms_filtered,
FUN = function(x) mean(x, na.rm = TRUE))
# Renaming 'wind' to 'average_wind_speed'
names(storms_aggregated)[which(names(storms_aggregated) == "wind")] <- "average_wind_speed"
# Ordering by average wind speed in descending order
storms_ordered <- storms_aggregated[order(-storms_aggregated$average_wind_speed), ]
# Moving 'year' column right before 'name'
storms_ordered <- storms_ordered[c("year", "name", "average_wind_speed")]
# View the final result
storms_ordered year name average_wind_speed
48 2021 Sam 93.22034
42 2021 Larry 82.17391
60 2022 Ian 70.50000
25 2020 Teddy 70.30612
57 2022 Fiona 69.18033
5 2020 Delta 67.74194
14 2020 Iota 63.65385
30 2020 Zeta 59.20000
15 2020 Isaias 57.63889
8 2020 Epsilon 56.34146
55 2022 Earl 55.57692
64 2022 Martin 55.23810
9 2020 Eta 54.82759
52 2022 Bonnie 54.27273
39 2021 Ida 53.62500
24 2020 Sally 52.32143
18 2020 Laura 52.26190
35 2021 Elsa 51.04651
37 2021 Grace 50.78947
13 2020 Hanna 50.27778
54 2022 Danielle 48.70968
22 2020 Paulette 48.01136
20 2020 Nana 46.92308
65 2022 Nicole 46.15385
38 2021 Henri 44.52381
61 2022 Julia 44.28571
1 2020 Arthur 43.61111
50 2021 Wanda 43.61111
63 2022 Lisa 42.70833
51 2022 Alex 42.64706
27 2020 Theta 42.12121
40 2021 Julian 41.92308
11 2020 Gamma 41.42857
19 2020 Marco 41.42857
58 2022 Gaston 41.17647
45 2021 Odette 39.41860
31 2021 Ana 39.41176
17 2020 Kyle 39.28571
49 2021 Victor 37.72727
46 2021 Peter 37.50000
4 2020 Cristobal 37.22222
12 2020 Gonzalo 37.17391
44 2021 Nicholas 36.66667
3 2020 Beta 36.02941
62 2022 Karl 35.83333
32 2021 Bill 35.50000
33 2021 Claudette 34.56522
43 2021 Mindy 34.50000
34 2021 Danny 33.57143
53 2022 Colin 33.33333
7 2020 Edouard 33.18182
16 2020 Josephine 33.12500
28 2020 Vicky 32.91667
29 2020 Wilfred 32.64706
47 2021 Rose 32.50000
36 2021 Fred 32.44444
2 2020 Bertha 32.14286
23 2020 Rene 31.87500
6 2020 Dolly 31.66667
21 2020 Omar 30.74074
41 2021 Kate 30.71429
66 2022 Twelve 30.00000
59 2022 Hermine 29.58333
10 2020 Fay 28.51852
26 2020 Ten 27.50000
56 2022 Eleven 25.83333
dplyrThis code provides some more examples of how the dplyr functions can be used to perform advanced data processing, highlighting the package’s strengths in data manipulation.
library(tidyverse)
storms %>%
mutate(wind_category = case_when( # Create a new column 'wind_category' based on wind speed using case_when
wind < 74 ~ "Not a hurricane",
wind >= 74 & wind < 96 ~ "Category 1",
wind >= 96 & wind < 111 ~ "Category 2",
wind >= 111 & wind < 130 ~ "Category 3",
wind >= 130 & wind < 157 ~ "Category 4",
TRUE ~ "Category 5"
)) %>%
mutate(major_hurricane = if_else(condition = wind_category %in% c("Category 3", "Category 4", "Category 5"),
true = "Yes", false = "No")) %>% # Create a new column 'major_hurricane' using if_else to identify major hurricanes (Category 3 and above)
filter(major_hurricane == "Yes") %>% # Filter to keep only rows where 'major_hurricane' is "Yes"
distinct(name, year, major_hurricane) %>% # Keep only distinct rows based on 'name', 'year', and 'major_hurricane'
count(year, major_hurricane, sort = TRUE, name = "major_hurricane_per_year") %>% # Count the number of major hurricanes per year and sort the result
select(-major_hurricane) # Remove the 'major_hurricane' column from the final result# A tibble: 37 × 2
year major_hurricane_per_year
<dbl> <int>
1 1999 5
2 2005 5
3 2020 5
4 2004 4
5 2008 4
6 2010 4
7 2017 4
8 1988 3
9 1995 3
10 1978 2
# ℹ 27 more rows
dplyrAgain, it is possible to achieve the same end goal using base R and stats functions, but the code is not so straightforward, and save intermediate data objects in R environment.
library(tidyverse) # to have access to the storms dataset
# Create a new column 'wind_category' based on wind speed
storms$wind_category <- with(storms, ifelse(wind < 74, "Not a hurricane",
ifelse(wind < 96, "Category 1",
ifelse(wind < 111, "Category 2",
ifelse(wind < 130, "Category 3",
ifelse(wind < 157, "Category 4", "Category 5"))))))
# Create a new column 'major_hurricane' to identify major hurricanes
storms$major_hurricane <- ifelse(test = storms$wind_category %in% c("Category 3", "Category 4", "Category 5"),
yes = "Yes", no = "No")
storms_unique <- storms[!duplicated(storms[c("name", "year", "major_hurricane")]), ] # Keep only distinct rows based on 'name', 'year', and 'major_hurricane'
major_hurricanes <- storms_unique[storms_unique$major_hurricane == "Yes", ] # Filter to keep only major hurricanes
major_hurricane_count <- aggregate(cbind(major_hurricane_per_year = major_hurricanes$wind) ~ year, # Count the number of major hurricanes per year
data = major_hurricanes, FUN = length)
major_hurricane_count <- major_hurricane_count[order(-major_hurricane_count$major_hurricane_per_year), ] # Sorting by count in descending order
head(major_hurricane_count) # View the result year major_hurricane_per_year
17 1999 5
23 2005 5
35 2020 5
22 2004 4
25 2008 4
27 2010 4
dplyrdplyr:
library(tidyverse)
# Create a small custom dataset using tribble
storm_categories <- tribble(
~category, ~description,
"Category 1", "Very dangerous winds",
"Category 2", "Extremely dangerous winds",
"Category 3", "Devastating damage",
"Category 4", "Catastrophic damage",
"Category 5", "High chance of being deadly")
storms <- storms %>% # Add a wind category to the storms dataset for joining
mutate(wind_category = case_when(
wind < 74 ~ "Not a hurricane",
wind >= 74 & wind < 96 ~ "Category 1",
wind >= 96 & wind < 111 ~ "Category 2",
wind >= 111 & wind < 130 ~ "Category 3",
wind >= 130 & wind < 157 ~ "Category 4",
TRUE ~ "Category 5"))full_join_result <- full_join(storms, storm_categories, by = c("wind_category" = "category"))
full_join_result %>%
select(name, year, wind_category, description) %>%
slice(1:3)# A tibble: 3 × 4
name year wind_category description
<chr> <dbl> <chr> <chr>
1 Amy 1975 Not a hurricane <NA>
2 Amy 1975 Not a hurricane <NA>
3 Amy 1975 Not a hurricane <NA>
inner_join_result <- inner_join(storms, storm_categories, by = c("wind_category" = "category"))
inner_join_result %>%
select(name, year, wind_category, description) %>%
slice(1:3)# A tibble: 3 × 4
name year wind_category description
<chr> <dbl> <chr> <chr>
1 Blanche 1975 Category 1 Very dangerous winds
2 Blanche 1975 Category 1 Very dangerous winds
3 Caroline 1975 Category 2 Extremely dangerous winds
dplyrdplyr:
library(tidyverse)
# Create a small custom dataset using tribble
storm_categories <- tribble(
~category, ~description,
"Category 1", "Very dangerous winds",
"Category 2", "Extremely dangerous winds",
"Category 3", "Devastating damage",
"Category 4", "Catastrophic damage",
"Category 5", "High chance of being deadly")
storms <- storms %>% # Add a wind category to the storms dataset for joining
mutate(wind_category = case_when(
wind < 74 ~ "Not a hurricane",
wind >= 74 & wind < 96 ~ "Category 1",
wind >= 96 & wind < 111 ~ "Category 2",
wind >= 111 & wind < 130 ~ "Category 3",
wind >= 130 & wind < 157 ~ "Category 4",
TRUE ~ "Category 5"))right_join_result <- right_join(storms, storm_categories, by = c("wind_category" = "category"))
right_join_result %>%
select(name, year, wind_category, description) %>%
slice(1:3)# A tibble: 3 × 4
name year wind_category description
<chr> <dbl> <chr> <chr>
1 Blanche 1975 Category 1 Very dangerous winds
2 Blanche 1975 Category 1 Very dangerous winds
3 Caroline 1975 Category 2 Extremely dangerous winds
left_join_result <- left_join(storms, storm_categories, by = c("wind_category" = "category"))
left_join_result %>%
select(name, year, wind_category, description) %>%
slice(1:3)# A tibble: 3 × 4
name year wind_category description
<chr> <dbl> <chr> <chr>
1 Amy 1975 Not a hurricane <NA>
2 Amy 1975 Not a hurricane <NA>
3 Amy 1975 Not a hurricane <NA>
dplyrdplyr |
merge |
Description |
|---|---|---|
full_join() |
merge(..., all = TRUE) |
This performs a full outer join, combining all rows from both datasets. When there’s no match in one dataset, NA values are introduced in the resulting dataset. |
inner_join() |
merge(..., all = FALSE) |
This conducts an inner join, returning only the rows with matching values in both datasets. Rows without a corresponding match in either dataset are excluded. |
left_join() |
merge(..., all.x = TRUE) |
This performs a left outer join, retaining all rows from the first dataset and matching rows from the second dataset. NA values are filled in where the second dataset has no match. |
right_join() |
merge(..., all.y = TRUE) |
This executes a right outer join, keeping all rows from the second dataset and matching rows from the first dataset. Rows in the second dataset without a match in the first dataset are filled with NA in the resulting dataset. |
tidyrtidyrtidyr is a handy package when it comes to data processing and manipulation, allowing transforming messy data into a structured, tidy format product.
Key Solutions:
Handling Messy Data: Streamlines the process of cleaning and organizing data, making it compatible with other Tidyverse packages.
Data Transformation: Provides tools for converting between wide and long formats, handling missing values, and separating or uniting columns.
Some relevant tidyr functions:
pivot_longer(): Transforms data from wide to long format, making it easier to analyze with other Tidyverse tools.
pivot_wider(): Converts data from long to wide format, useful for creating human-readable tables.
separate_wider_delim(): Splits a single column into multiple columns, ideal for unpacking complex fields.
unite(): Combines multiple columns into a single column, simplifying datasets with redundant columns.
drop_na(): Removes rows with missing values, streamlining datasets for analysis.
replace_na(): Substitutes NA values with specified replacements, maintaining data integrity.
*Note that other functions from tidyr might be more relevant depending on our specific needs.
tidyrpivot_wider().library(tidyverse)
storms_wider <- storms %>%
filter(year >= 2013) %>%
select(year, status, wind) %>%
group_by(year, status) %>%
summarise(max_wind_speed = max(wind), .groups = "drop") %>%
pivot_wider(names_from = "status", values_from = "max_wind_speed")
storms_wider# A tibble: 10 × 10
year disturbance extratropical hurricane `other low` subtropical depressio…¹
<dbl> <int> <int> <int> <int> <int>
1 2013 35 45 80 50 30
2 2014 25 65 125 40 NA
3 2015 NA 65 135 55 NA
4 2016 NA 70 145 45 NA
5 2017 40 75 155 45 30
6 2018 45 75 140 40 30
7 2019 50 75 160 80 NA
8 2020 40 75 135 45 30
9 2021 40 70 135 40 NA
10 2022 40 100 140 55 NA
# ℹ abbreviated name: ¹`subtropical depression`
# ℹ 4 more variables: `subtropical storm` <int>, `tropical depression` <int>,
# `tropical storm` <int>, `tropical wave` <int>
pivot_longer().storms_wider %>%
pivot_longer(cols = disturbance:`tropical wave`, names_to = "status", values_to = "max_wind_speed") %>%
drop_na()# A tibble: 73 × 3
year status max_wind_speed
<dbl> <chr> <int>
1 2013 disturbance 35
2 2013 extratropical 45
3 2013 hurricane 80
4 2013 other low 50
5 2013 subtropical depression 30
6 2013 subtropical storm 55
7 2013 tropical depression 30
8 2013 tropical storm 60
9 2014 disturbance 25
10 2014 extratropical 65
# ℹ 63 more rows
tidyrIt is possible to do the same thing using R packages outside the tidyverse. However, as mentioned before, the code is not so straightforward and more difficult to follow (especially for beginners).
library(tidyverse) # to get the storms dataset
# Filter the dataset for years 2013 and onwards
storms_filtered <- subset(storms, year >= 2013)
# Select only the year, status, and wind columns
storms_selected <- storms_filtered[, c("year", "status", "wind")]
# Aggregate to find the maximum wind speed for each year and status combination
storms_aggregated <- aggregate(wind ~ year + status, data = storms_selected, max)
# Renaming the aggregated column
names(storms_aggregated)[which(names(storms_aggregated) == "wind")] <- "max_wind_speed"
# Reshape the data from long to wide format
storms_wider <- reshape(storms_aggregated, timevar = "status", idvar = "year", direction = "wide")
# Renaming the columns to match the column names as they are recorded in storms dataset
colnames(storms_wider) <- gsub("max_wind_speed.", "", colnames(storms_wider))
storms_wider # View the result year disturbance extratropical hurricane other low subtropical depression
1 2013 35 45 80 50 30
2 2014 25 65 125 40 NA
3 2017 40 75 155 45 30
4 2018 45 75 140 40 30
5 2019 50 75 160 80 NA
6 2020 40 75 135 45 30
7 2021 40 70 135 40 NA
8 2022 40 100 140 55 NA
11 2015 NA 65 135 55 NA
12 2016 NA 70 145 45 NA
subtropical storm tropical depression tropical storm tropical wave
1 55 30 60 NA
2 50 30 60 NA
3 NA 30 60 30
4 50 30 60 40
5 55 30 60 NA
6 60 30 60 NA
7 50 30 60 NA
8 45 30 60 NA
11 50 30 60 NA
12 55 30 60 NA
tidyrIt is possible to do the same thing using R packages outside the tidyverse. However, as mentioned before, the code is not so straightforward and more difficult to follow (especially for beginners).
library(reshape2)
# Reshape from wide to long format using melt
storms_longer <- melt(storms_wider, id.vars = "year",
measure.vars = names(storms_wider)[names(storms_wider) != "year"],
variable.name = "status", value.name = "max_wind_speed")
# Drop rows with NA in 'max_wind_speed'
storms_longer <- storms_longer[!is.na(storms_longer$max_wind_speed), ]
# Modify the 'status' column to remove the prefix and keep only the text after the dot
storms_longer$status <- sub(".*\\.", "", storms_longer$status)
# Order the rows by 'year'
storms_longer <- storms_longer[order(storms_longer$year), ]
storms_longer[1:10, ] # View the result of the first 10 rows year status max_wind_speed
1 2013 disturbance 35
11 2013 extratropical 45
21 2013 hurricane 80
31 2013 other low 50
41 2013 subtropical depression 30
51 2013 subtropical storm 55
61 2013 tropical depression 30
71 2013 tropical storm 60
2 2014 disturbance 25
12 2014 extratropical 65
tidyrunite() to combine year, month and day into a single column named date.# A tibble: 19,537 × 13
name date hour lat long status category wind pressure
<chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Amy 1975-6-27 0 27.5 -79 tropical depression NA 25 1013
2 Amy 1975-6-27 6 28.5 -79 tropical depression NA 25 1013
3 Amy 1975-6-27 12 29.5 -79 tropical depression NA 25 1013
4 Amy 1975-6-27 18 30.5 -79 tropical depression NA 25 1013
5 Amy 1975-6-28 0 31.5 -78.8 tropical depression NA 25 1012
6 Amy 1975-6-28 6 32.4 -78.7 tropical depression NA 25 1012
7 Amy 1975-6-28 12 33.3 -78 tropical depression NA 25 1011
8 Amy 1975-6-28 18 34 -77 tropical depression NA 30 1006
9 Amy 1975-6-29 0 34.4 -75.8 tropical storm NA 35 1004
10 Amy 1975-6-29 6 34 -74.8 tropical storm NA 40 1002
# ℹ 19,527 more rows
# ℹ 4 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>, wind_category <chr>, major_hurricane <chr>
separate_wider_delim() to split the date column back into three columns.storms %>%
unite("date", year, month, day, sep = "-") %>%
separate_wider_delim(cols = date, delim = "-", names = c("year", "month", "day")) %>%
mutate(category = replace_na(category, 0)) %>%
filter(category == 0)# A tibble: 14,734 × 15
name year month day hour lat long status category wind pressure
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Amy 1975 6 27 0 27.5 -79 tropical d… 0 25 1013
2 Amy 1975 6 27 6 28.5 -79 tropical d… 0 25 1013
3 Amy 1975 6 27 12 29.5 -79 tropical d… 0 25 1013
4 Amy 1975 6 27 18 30.5 -79 tropical d… 0 25 1013
5 Amy 1975 6 28 0 31.5 -78.8 tropical d… 0 25 1012
6 Amy 1975 6 28 6 32.4 -78.7 tropical d… 0 25 1012
7 Amy 1975 6 28 12 33.3 -78 tropical d… 0 25 1011
8 Amy 1975 6 28 18 34 -77 tropical d… 0 30 1006
9 Amy 1975 6 29 0 34.4 -75.8 tropical s… 0 35 1004
10 Amy 1975 6 29 6 34 -74.8 tropical s… 0 40 1002
# ℹ 14,724 more rows
# ℹ 4 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>, wind_category <chr>, major_hurricane <chr>
tidyrReplicating the functionalities of the unite() function can be also done using base R functions.
library(tidyverse) # to load the storms dataset
# Create 'date' column by concatenating 'year', 'month', and 'day'
storms$date <- with(storms, paste(year, month, day, sep = "-"))
# Remove the 'year', 'month', and 'day' columns
storms <- storms[, !(names(storms) %in% c("year", "month", "day"))]
# Reorder columns to place 'date' after 'name'
cols_order <- c("name", "date", setdiff(names(storms), c("name", "date")))
storms <- storms[, cols_order]
storms[1:10, ] # viewing the first 10 rows of the data# A tibble: 10 × 13
name date hour lat long status category wind pressure
<chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Amy 1975-6-27 0 27.5 -79 tropical depression NA 25 1013
2 Amy 1975-6-27 6 28.5 -79 tropical depression NA 25 1013
3 Amy 1975-6-27 12 29.5 -79 tropical depression NA 25 1013
4 Amy 1975-6-27 18 30.5 -79 tropical depression NA 25 1013
5 Amy 1975-6-28 0 31.5 -78.8 tropical depression NA 25 1012
6 Amy 1975-6-28 6 32.4 -78.7 tropical depression NA 25 1012
7 Amy 1975-6-28 12 33.3 -78 tropical depression NA 25 1011
8 Amy 1975-6-28 18 34 -77 tropical depression NA 30 1006
9 Amy 1975-6-29 0 34.4 -75.8 tropical storm NA 35 1004
10 Amy 1975-6-29 6 34 -74.8 tropical storm NA 40 1002
# ℹ 4 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>, wind_category <chr>, major_hurricane <chr>
Replicating the functionalities of the function separate_wider_delim() can be also done with base R functions.
rm(storms) #cleaning the modifications done previously in this object
library(tidyverse) # to access the storms dataset
# Create 'date' column by concatenating 'year', 'month', and 'day'
storms$date <- with(storms, paste(year, month, day, sep = "-"))
# Split 'date' column into 'year', 'month', and 'day' columns
date_parts <- do.call(rbind, strsplit(storms$date, "-"))
storms$year <- date_parts[, 1]
storms$month <- date_parts[, 2]
storms$day <- date_parts[, 3]
# Replace NA values in 'category' with 0
storms$category[is.na(storms$category)] <- 0
# Filter rows where 'category' is 0
storms_filtered <- storms[storms$category == 0, ]
storms_filtered[1:10, ] # View the result of the first 10 rows# A tibble: 10 × 14
name year month day hour lat long status category wind pressure
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Amy 1975 6 27 0 27.5 -79 tropical d… 0 25 1013
2 Amy 1975 6 27 6 28.5 -79 tropical d… 0 25 1013
3 Amy 1975 6 27 12 29.5 -79 tropical d… 0 25 1013
4 Amy 1975 6 27 18 30.5 -79 tropical d… 0 25 1013
5 Amy 1975 6 28 0 31.5 -78.8 tropical d… 0 25 1012
6 Amy 1975 6 28 6 32.4 -78.7 tropical d… 0 25 1012
7 Amy 1975 6 28 12 33.3 -78 tropical d… 0 25 1011
8 Amy 1975 6 28 18 34 -77 tropical d… 0 30 1006
9 Amy 1975 6 29 0 34.4 -75.8 tropical s… 0 35 1004
10 Amy 1975 6 29 6 34 -74.8 tropical s… 0 40 1002
# ℹ 3 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>, date <chr>
base R Vs. tidyversebase Rtidyversetibble, a modern take on the data.frame.dplyr, enabling complex operations with fewer lines of code.tidyverse core packagestidyverse core packagesBear in mind that they are as useful and powerful as the 3 packages (tibble, dplyr and tidyr) covered in this lecture. Mastering them can be very advantageous to work more efficiently with tidyverse suit of packages and empower us to create more useful, tidy and clear workflows.
tidyverse core packagesIt is important to highlight that the same syntax applies for the remainder tidyverse core packages. So it is possible, and recommended, to build modular codes, it can even be done for plotting (ggplot2).
tidyverse core packageslibrary(tidyverse)
# Load the tidyverse package for data manipulation and plotting
library(tidyverse)
storms %>%
filter(year > 2009) %>% # Filter the data to include only years greater than 2009
group_by(year) %>% # Group the filtered data by 'year'
summarise(mean_wind_speed = mean(wind, na.rm = T)) %>% # Calculate the mean wind speed for each year, ignoring NA values
ggplot(aes(x = year, y = mean_wind_speed)) + # Initialize a ggplot object, mapping 'year' to x-axis and 'mean_wind_speed' to y-axis
geom_col(fill = "dodgerblue2", col = "black") + # Add a column plot (bar plot) to the ggplot object with specific color and border
coord_flip() + # Flip the coordinates to make the bars horizontal
labs(title = "Storms", x = "Year", y = "Average wind speed (Km/h)") + # Add a title and labels to the x-axis and y-axis
theme_bw() + # Apply a black-and-white theme for a cleaner look
theme(text = element_text(size = 14)) # Customize the text size for all text elements in the plotTidyverse Main Website: tidyverse
R for Data Science: Online book by Hadley Wickham et al. 2023
Posit YouTube Channel: Video contents from the tidyverse developers in they YouTube channel.